White Wine Analysis

Greg Hein 5/6/2016


The original data set under consideration contains 4,898 white wines with 11 variables quantifying the chemical properties of each wine. In addition, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The guiding question for this analysis is:

Which chemical properties influence the quality of white wines?

Furthur information regarding the dataset available at this link


To get an initial look at the wine data set, I will look at the variable names, structure, and summary.


Variable names

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "bound.sulfur.dioxide" "quality.level"

Data Structure

## 'data.frame':    4898 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ bound.sulfur.dioxide: num  125 118 67 139 139 67 106 125 118 101 ...
##  $ quality.level       : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...

Data Summary

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality      bound.sulfur.dioxide quality.level
##  Min.   : 8.00   Min.   :3.000   Min.   :  4.0        3:  20       
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.: 78.0        4: 163       
##  Median :10.40   Median :6.000   Median :100.0        5:1457       
##  Mean   :10.51   Mean   :5.878   Mean   :103.1        6:2198       
##  3rd Qu.:11.40   3rd Qu.:6.000   3rd Qu.:125.0        7: 880       
##  Max.   :14.20   Max.   :9.000   Max.   :331.0        8: 175       
##                                                       9:   5

This is a good starting point, but also a lot of numbers to digest. In the next section I will begin to plot the data in order to visualize it.


Univariate Plots Section


Wine Quality Bar Chart and Table

Because quality is our primary focus in this analysis, an important question to answer is: what is its distribution amongst the wines? A bar chart of quality will help visualize this. Also, to answer the question of exactly how many wines are in each quality bin, a table follows the chart.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

It can be seen that the vast majority of wines fall in the 5-7 range. Specifically, there are 4535 wines in the 5-7 range which is 92.59% of all the wines. The median of Quality is 6, and the mean is 5.878. The smallest bin is the highest quality wines at 9, of which there are only 5 (0.1%).


Histograms of all chemical properties of the data:

Similar to the above look into the distribution of quality, now a look at the distributions of the chemical properties.

## 
## Fixed Acidity Median & Mean: 6.8 & 6.855

## 
## Volatile Acidity Median & Mean: 0.26 & 0.278

## 
## Citric Acid Median & Mean: 0.32 & 0.334

## 
## pH Median & Mean: 3.18 & 3.188

## 
## Free Sulfur Dioxide Median & Mean: 34 & 35.308

## 
## Bound Sulfur Dioxide Median & Mean: 100 & 103.053

## 
## Total Sulfur Dioxide Median & Mean: 134 & 138.361

## 
## Sulfates Median & Mean: 0.47 & 0.490

## 
## Chlorides Median & Mean: 0.043 & 0.046

## 
## Density Median & Mean: 0.99374 & 0.994

Although none of the above chemical properties has exactly the same median and mean, they appear to have a relatively normal distribution.


## 
## Residual Sugar Median & Mean: 5.2 & 6.391

There appears to be a large number of wines in the lowest residual sugar bin. It alsoo appears to show a positive skew.


## 
## Alcohol Median & Mean: 10.4 & 10.514

The alcohol distribution appears relatively flat, except that it contains one unusually large bin.


For all of the chemical properties, the median is smaller than the mean.


Univariate Analysis


What is the structure of your dataset?

This data set contains 4,898 white wines with 12 variables quantifying the chemical properties of each wine, and 2 others reporting the subjective quality of each wine.

What is/are the main feature(s) of interest in your dataset?

The main feature is wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All of the other features that are chemical properties of the wines may help support the investigation into wine quality. That is the question under investigation.

Did you create any new variables from existing variables in the dataset?

Although I have no prior knowledge that it will have any effect I created a new variable of bound sulfur dioxide. It seemed an obvious variable to create with the free and total sulfur dioxide already being present in the data.

I also created another variable of “quality level” which is simply quality as a feature. This was done to make certain graphs easier to create going forward.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Residual sugar appears to have many values in the lowest bin.

According to Wikipedia:

“Even among the driest wines, it is rare to find wines with a level of less than 1 g/L, due to the unfermentability of certain types of sugars, such as pentose.”

It was determined how many of the wines measured residual sugar of less than one:

## [1] 15

Histogram of wines with residual sugar >= one:

This histogram is similar to the original, so it was determined not to remove any data from the set.

Residual.sugar skews positive, while most other histograms resembled a somewhat normal distribution. To gain further perspective a log transformation was done on residual sugar:

The log scale for residual sugar looks somewhat bimodal


The alcohol histogram looked somewhat flat, so log and square root transformations were done.

Log transformation:

Square root transformation:

After the transformations, the alcohol histogram is still somewhat flat, although gives the impression of a positive skew.


Now that we have had a look at the variables in the data set individually, the next step will be to begin to look at how the variables relate to each other.


Bivariate Plots Section


For an initial quick overview of the relationships betweens the variables in the data set, a correlation matrix is a good starting point.

Correlation Matrix

## The variables shown in the correlation matrix below are in 
##      the following order:
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "bound.sulfur.dioxide"


To get more detail, and begin to answer the question of how the chemical properties of the wines relate to quality, the next section will look at scatter plots comparing the chemical properties with quality and the correlation between each chemical property and quality.


Scatter Plots of Chemical Properties vs. Quality with Correlations

## Correlation of Quality and Fixed Acidity: -0.1136628

## Correlation of Quality and Volatile Acidity: -0.194723

## Correlation of Quality and Citric Acid: -0.009209091

There appears to be little to no relationship between citric acid and quality.


## Correlation of Quality and Residual Sugar: -0.09757683

## Correlation of Quality and Chlorides: -0.2099344

## Correlation of Quality and Free Sulfur Dioxide: 0.008158067

There appears to be little to no relationship between free sulfur dioxide and quality.


## Correlation of Quality and Total Sulfur Dioxide: -0.1747372

## Correlation of Quality and Bound Sulfur Dioxide: -0.2178678

## Correlation of Quality and Density: -0.3071233

The relationship of density and quality shows the largest negative correlation. It can also be seen above that there are somewhat significant negative correlations between quality and chlorides, bound sulfur dioxide, and total sulfur dioxide.


## Correlation of Quality and pH: 0.09942725

## Correlation of Quality and Sulphates: 0.05367788

## Correlation of Quality and Alcohol: 0.4355747

Alcohol’s correlation with quality was the one with the highest magnitude.


To get a better look at the distribution of the chemical properties in each quality bin, box plots are shown next. These box plots will be a nice visual of some of the numerical data presented above in the “Data Summary”" section.


Box Plots of Chemical Properties vs. Quality


In the previous section of scatter plots, the citric acid/quality correlation line appeared flat. In looking at the box plot there is a suggestion that the highest (9 rated) quality wines may have a bit more citric acid. It must be kept in mind that there are only 5 wines in that bin.


The quality/alcohol box plot appears to suggest that the positive correlation between quality and alcohol shown in the scatter plot section may not hold for the lowest (3 & 4 rated) quality wines.


Bivariate Analysis


Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There were 4 different measures of acidity in the data: fixed acidity, volatile acidity, citric acid, and pH. In all cases, quality was inversely correlated with measures of acidity (lower pH readings mean higher acidity). The strongest negative correlations with quality were Chlorides, Bound Sulfur Dioxide, and Density. The most significant positive correlation to wine quality was alcohol.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The two strongest correllations amongst the chemical properties of the wines were between bound sulfur dioxide and total sulfur dioxide (0.9224823), and between residual sugar and density (0.8389665).

What was the strongest relationship you found?

The strongest relationship with quality was a positive one with alcohol.


In the next section, I will incorporate another variable into the plots. Specifically to look at the question of how combinations of chemical properties relate to quality.


Multivariate Plots Section


The next 6 plots are scatter plots for pairs of 2 different chemical properties versus each other. To add the third dimension to the analysis, the color of the data points reflects the quality: lower quality wines have lighter colors, higher quality wines have darker colors.


It appears the lightest (lowest quality) area of the plot is where alcohol is low, and bound sulfur dioxide is high.



This plot shows the strong negative relationship between alcohol and density, with the low density/high alcohol wines in general having greater quality than high density/low alcohol wines.



It is interesting in the plot of free sulfur dioxide and citric acid to see that in the area where free sulfur dioxide is greater than 120, 8 of the 20 wines with a 3 rating appear, and all wines in this region are of lower quality.



Although the above 3 plots do not seem to give us any new information, they do reinforce the previous correlations found in the bivariate section.


The final multivariate plot shows two of the chemical properties faceted by quality. It shows an interesting relationship discussed below.



Multivariate Analysis


Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For this section I chose a number of pairs of chemical properties, and looked at how their relationship to each other changed with respect to the main feature of quality. Some of the observations showed no obvious changes with respect to quality (eg. citric acid vs. free sulfur dioxide). Other observations reinforced some of the previous bivariate observations. For example, the relationship between chlorides and quality. Looking at the two observations of alchohol vs. chlorides and citric acid vs. chlorides above, one can see that the highest quality wines have low chlorides as opposed to the lower quality wines. Also, in the the plots that include alcohol, it is obvious that higher quality wines tend to have higher alcohol.

Were there any interesting or surprising interactions between features?

I thought the most interesting interaction was seen in the plot of sulphates and bound sulfur dioxide faceted by quality. In the lower quality wines the relationship seems to be sloping upwards suggesting a postive relationship. As the wine quality increases the slope appears to flatten, and at the highest qualities looks to be flat, suggesting no relationship between sulphates and bound sulfur dioxide.


Final Plots and Summary


Plot One

Description One

The guiding question for this analysis is: Which chemical properties influence the quality of white wines? The variable that correlated most strongly with quality was alcohol. This can be seen in the above scatter plots. The linear regression line is obviously postive sloping. As shown earlier the correlation coefficient is 0.4355747. The smoothed fit curve adds more information. It can be seen that the positive relationship does not appear to occur below a quality rating of 5. Above 5, the relationship is clear.


Plot Two

Description Two

Another chemical property that showed a strong relationship with quality was negative one with density. This can be seen in the above histogram. The higher quality wines are seen in much larger proportion amongst the lower densities in the plot. One interesting observation in the histogram is that the lowest quality wines are in general distributed relatively evenly around the center.


Plot Three

Description Three

This plot was chosen as an interesting observation and extension of plot 2. As seen in plot 2 (and previous bivariate analysis), in general higher quality wines have lower densities. This plot (plot 3) shows that too. Plot 3 also incorporates residual sugar, and it shows an interesting trend. Amongst those low density/high quality wines, they also appear to be the wines with a higher residual sugar.


Reflection


With 4898 observations, this is a relatively large data set. With 11 variables quantifying the chemical properties of each wine, there seemed to be plenty of measurements to analyze. As opposed to the chemical measurements of the wines in the data set, wine quality was an entirely subjective measure. One could possibly criticize the data for that reason. Personally, I feel that there would appear to be no other way to measure quality than subjectively. The quality ratings were said to be “median of at least 3 evaluations made by wine experts”. Perhaps future data can be collected with more that 3 evaluations per wine.

The guiding question of this analysis was which chemical properties of white wines have an effect on wine quality. Both by determining correlation, and observing box and scatter plots, it was shown that the highest effect on quality was alcohol. Other important determinants were similarly found with a negative effect, specifically Density, Chlorides,and Bound Sulfur Dioxide.

I felt that it was important throughout the analysis to keep in mind the focus on the guiding question. For that reason, in the bivariate section all the chemical properties were plotted against quality, and in the multivariate sections I chose to analyze pairs of chemical properties with respect to quality. In hindsight, I feel those were good choices, and the analysis was better due to that focus.

I did have some difficulty in the multivariate section of the analysis. I tried to analyze many different pairs of chemical properties with respect to quality (some not shown in final analysis), but seemed to not find many interesting relationships. I also felt it important to have a multivariate plot in the final plots, still keeping in mind that it needed to focus on quality. It took some time to find a plot that was of interest and furthered the analysis. In the end, some interesting multivariate trends were found.

Were I to conduct any future analysis, It would be interesting to loosen the focus on quality, and learn more of the relationships amongst the many chemical properties.

This analysis was also an opportunity to learn the tools of the R programming language, specifically the ggplot2 package as a means to analyze a large set of data. Those tools proved to be a good resource for a data analyst to conduct an analysis. Of possible greater importance, it also appears to be an effective way to communicate those findings to others.